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Abstract 

We present a novel response generation sys¬ 
tem that can be trained end to end on large 
quantities of unsttuctured Twitter conversa¬ 
tions. A neural network architecture is used 
to address sparsity issues that arise when in¬ 
tegrating contextual information into classic 
statistical models, allowing the system to take 
into account previous dialog utterances. Our 
dynamic-context generative models show con¬ 
sistent gains over both context-sensitive and 
non-context-sensitive Machine Translation and 
Information Retrieval baselines. 


1 Introduction 


Until recently, the goal of training open-domain con¬ 
versational systems that emulate human conversation 
has seemed elusive. However, the vast quantities 
of conversational exchanges now available on so¬ 
cial media websites such as Twitter and Reddit raise 
the prospect of building data-driven models that can 
begin to communicate conversationally. Tbe work 
of Ritter et al. (201 1| |, for example, demonstrates that 
a response generation system can be constructed from 
Twitter conversations using statistical machine trans¬ 
lation techniques, where a status post by a Twitter 
user is “translated” into a plausible looking response. 
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Figure 1: Example of three consecutive utterances occur¬ 
ring between two Twitter users A and 13. 


However, an approach such as that presented in Rit¬ 


ter et al. (2011 1 does not address the challenge of 


generating responses that are sensitive to the context 
of the conversation. Broadly speaking, context may 
be linguistic or involve grounding in tbe physical or 
virtual world, but we here focus on linguistic context. 
The ability to take into account previous utterances 
is key to building dialog systems that can keep con¬ 
versations active and engaging. Figure [T] illustrates 
a typical Twitter dialog where the contextual infor¬ 
mation is crucial: the phrase “good luck” is plainly 
motivated by the reference to “your game” in the first 
utterance. In the MT model, such contextual sensitiv¬ 
ity is difficult to capture; moreover, naive injection 
of context information would entail unmanageable 
growth of the phrase table at the cost of increased 
sparsity, and skew towards rarely-seen context pairs. 
In most statistical approaches to machine translation, 
phrase pairs do not share statistical weights regard¬ 
less of their intrinsic semantic commonality. 

We propose to address the challenge of context- 
sensitive response generation by using continuous 
representations or embeddings of words and phrases 








to compactly encode semantic and syntactic simi¬ 
larity. We argue that embedding-based models af¬ 
ford flexibility to model the transitions between con¬ 
secutive utterances and to capture long-span depen¬ 
dencies in a domain where traditional word and 


phrase alignment is difficult (Ritter et ah, 20111. To 
this end, we present two simple, context-sensitive 
response-generation models utilizing the Recurrent 
Neural Network Language Model (RLM) architec¬ 
ture of ( [Mikolov et ah, 2010| ). These models first 
encode past information in a hidden continuous repre¬ 
sentation, which is then decoded by the RLM to pro¬ 
mote plausible responses that are simultaneously flu¬ 
ent and contextually relevant. Unlike typical complex 
task-oriented multi-modular dialog systems ( Young j 
2002t Stent and Bangalore, 2014 1 , our architecture 
is completely data-driven and can easily be trained 
end-to-end using unstructured data without requiring 
human annotation, scripting, or automatic parsing. 

This paper makes the following contributions. We 
present a neural network architecture for response 
generation that is both context-sensitive and data- 
driven. As such, it can be trained from end to end on 
massive amounts of social media data. To our knowl¬ 
edge, this is the first application of a neural-network 
model to open-domain response generation, and we 
believe that the present work will lay groundwork for 
more complex models to come. We additionally in¬ 
troduce a novel multi-reference extraction technique 
that shows promise for automated evaluation. 

2 Related Work 


Our work naturally lies in the path opened by Ritter 


et al. (20111, but we generalize their approach by 
exploiting information from a larger context. Rit¬ 
ter et al. and our work represent a radical paradigm 
shift from other work in dialog. More traditional 
dialog systems typically tease apart dialog manage¬ 
ment ( Young, 2002| l from response generation (Stent 
and Bangalore, 2014h, while our holistic approach 


can be considered a first attempt to accomplish both 
tasks jointly. While there are previous uses of ma¬ 


chine learning for response generation (Walker et al. 
2003| l, dialog state tracking (Young et al., 20101, and 
user modeling (Georgila et al., 20061, many compo¬ 
nents of typical dialog systems remain hand-coded: 
in particular, the labels and attributes defining dia¬ 
log states. In contrast, the dialog state in our neural 


network model is completely latent and directly opti¬ 
mized towards end-to-end performance. In this sense, 
we believe the framework of this paper is a signif¬ 
icant milestone towards more data-driven and less 
hand-coded dialog processing. 

Continuous representations of words and phrases 
estimated by neural network models have been ap¬ 
plied on a variety of tasks ranging from Information 


Retrieval (IR) (Huang et al., 2013 Shen et al., 2014 1 , 


Online Recommendation (Gao et al., 2014bI, Ma¬ 


chine Translation (MT) ( |Auli et al., 20131 |Cho et ah 


2014t[Kalchbrenner and Blunsom, 2013HSutskever| 


et al., 2014]), and Language Modeling (LM) (Bengio 


et al., 200^ |Collobert and Weston, 2008) . Gao et 


al. ( 2014a| ) successfully use an embedding model to 
refine the estimation of rare phrase-translation prob¬ 
abilities, which is traditionally affected by sparsity 
problems. Robustness to sparsity is a crucial prop¬ 
erty of our method, as it allows us to capture context 
information while avoiding unmanageable growth of 
model parameters. 

Our work extends the Recurrent Neural Network 


Language Model (RLM) of (Mikolov et al., 20101, 
which uses continuous representations to estimate a 
probability function over natural language sentences. 
We propose a set of conditional RLMs where contex¬ 
tual information (i.e., past utterances) is encoded in 
a continuous context vector to help generate the re¬ 
sponse. Our models differ from most previous work 
in the way the context vector is constructed. For 


example, Mikolov and Zweig (2012 1 and Auli et al 


(20131 use a pre-trained topic model. In our models. 


the context vector is learned along with the condi¬ 
tional RLM that generates the response. Additionally, 
the learned context encodings do not exclusively cap¬ 
ture contentful words. Indeed, even “stop words” can 
carry discriminative power in this task; for exam¬ 
ple, all words in the utterance “how are you?” are 
commonly characterized as stop words, yet this is a 
contentful dialog utterance. 

3 Recurrent Language Model 

We give a brief overview of the Recurrent Language 


Model (RLM) ( [Mikolov et al., 20101 ) architecture that 
our models extend. A RLM is a generative model 
of sentences, i.e., given sentence s = si,..., st, it 
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The model architecture is parameterized hy three 
weight matrices, 0rnn = { Win , W ^ at , Whh )- an in¬ 
put matrix Win, a recurrent matrix Whh and an output 
matrix Wout, which are usually initialized randomly. 
The rows of the input matrix Win £ contain 

the if-dimensional embeddings for each word in the 
language vocabulary of size V. Let us denote by st 
both the vocabulary token and its one-hot representa¬ 
tion, i.e., a zero vector of dimensionality V with a 1 
corresponding to the index of the st token. The em¬ 
bedding for St is then obtained by sj Win- The recur¬ 
rent matrix Whh £ keeps a history of the sub¬ 

sequence that has already been processed. The output 
matrix Wout £ projects the hidden state ht 

into the output layer Of, which has an entry for each 
word in the vocabulary V. This value is used to gen¬ 
erate a probability distribution for the next word in 
the sequence. Specifically, the forward pass proceeds 
with the following recurrence, for f = 1,..., T: 

ht = a{sj Win + h~l_{Whh), Ot = h]Wout (2) 

where ct is a non-linear function applied element¬ 
wise, in our case the logistic sigmoid. The recurrence 
is seeded by setting /iq = 0, the zero vector. The 
probability distribution over the next word given the 
previous history is obtained by applying the softmax 
activation function: 


P{st = m|si,.. .,st-i) 


exp(of^) 

ELi exp(ot„) 


(3) 


The RLM is trained to minimize the negative log- 
likelihood of the training sentence s: 

T 

L{s) =-'^\ogP{st\si,... ,st-i). (4) 

t=i 


The recurrence is unrolled backwards in time us¬ 
ing the back-propagation through time (BPTT) al¬ 
gorithm ( Rumelhart et ah, 1988p , and gradients are 
accumulated over multiple time-steps. 
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Figure 2: Compact representation of an RLM (left) and 
unrolled representation for two time steps (right). 


4 Context-Sensitive Models 


We distinguish three linguistic entities in a conver¬ 
sation between two users A and B: the context] c, 
the message m and response r. The context c rep¬ 
resents a sequence of past dialog exchanges of any 
length; then B emits a message m to which A reacts 
by formulating its response r (see Figure [T]l. 

We use three context-based generation models to 
estimate a generation model of the response r, r = 
ri,, tt, conditioned on past information c and m: 

T 

p{r\c, m) = n p{rt\ri,...,rt-i,c,m). (5) 

t=i 

These three models differ in the manner in which 
they compose the context-message pair (c, m). 


4.1 Tripled Language Model 


In our first model, dubbed RLMT, we straightfor¬ 
wardly concatenate each utterance c, m, r into a 
single sentence s and train the RLM to minimize 
L(s). Given c and m, we compute the probability 
of the response as follows: we perform the forward 
propagation over the known utterances c and m to ob¬ 
tain a hidden state encoding useful information about 
previous utterances. Subsequently, we compute the 
likelihood of the response from that hidden state. 

An issue with this simple approach is that the con¬ 
catenated sentence s will be very long on average, 
especially if the context comprises multiple utter¬ 
ances. Modelling such long-range dependencies with 
an RLM is difficult and is still considered an open 
problem ([Pascanu et al., 2013p. We will consider 


*In this work, the context is purely linguistic, but future work 
might integrate further contextual information, e.g., geographical 
location, time information, or other forms of grounding. 
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Figure 3: Compact representations of DCGM-I (left) and 
DCGM-II (right). The decoder RLM receives a bias from 
the context encoder. In DCGM-I, we encode the bag-of- 
words representation of both c and m in a single vector 
bcm- In DCGM-II, we concatenate the representations 
and bjn on the first layer to preserve order information. 


RLMT as an additional context-sensitive baseline for 
the models we present next. 


4.2 Dynamic-Context Generative Model I 


The above limitation of RLMT can be addressed by 
strengthening the context bias. In our second model 
(DCGM-I), the context and the message are encoded 
into a fixed-length vector representation the is used 
by the RLM to decode the response. This is illus¬ 
trated in Figure[^(left). First, we consider c and m as 
a single sentence and compute a single bag-of-words 
representation bcm C 1^'^- Then, bcm is provided 
as input to a multilayered non-linear forward archi¬ 
tecture that produces a fixed-length representation 
that is used to bias the recurrent state of the decoder 
RLM. At training time, both the context encoder and 
the RLM decoder are learned so as to minimize the 
negative log-probability of the generated response. 

The parameters of the model are ©dcgm-i = 
{Win,Whh,Wout,{Wj}f^^), Where are 

the weights for the L layers of the feed-forward con¬ 
text networks. The fixed-length context vector ki is 
obtained by forward propagation of the network: 


h = bJ^Wj 

h = a{kJ_-^Wf) for£ = 2,---,L 


( 6 ) 


The rows of Wj contain the embeddings of the vo¬ 


cabulary0 These are different from those employed 
in the RLM and play a crucial role in promoting the 
specialization of the context encoder to a distinct 
task. The hidden layer of the decoder RLM takes the 
following form: 

ht = a{hJ_^Whh + kL + sJ Win) (7a) 

Ot = hj Wout (Vb) 

p(st+i|si,... m) = softmax(ot) (7c) 

This model conditions on the previous utterances via 
biasing the hidden layer state on the context repre¬ 
sentation ki- Note that the context representation 
does not change through time. This is useful because: 
(a) it forces the context encoder to produce a repre¬ 
sentation general enough to be useful for generating 
all words in the response and (b) it helps the RLM 
decoder to remember context information when gen¬ 
erating long responses. 


4.3 Dynamic-Context Generative Model II 


Because DCGM-I does not distinguish between c and 
m, that model has the propensity to underestimate 
the strong dependency that holds between m and r. 
Our third model (DCGM-II) addresses this issue by 
concatenating the two linear mappings of the bag-of- 
words representations be and bm in the input layer of 
the feed-forward network representing c and m (see 
Figure [fright). Concatenating continuous representa¬ 
tions prior to deep architectures is a common strategy 


to obtain order-sensitive representations (Bengio et 


al.,2003 Devlin et al, 20141. 


The forward equations for the context encoder are: 


T Tr/l 


fci = [b^ Wf 
uT 


hLW}] 


ke = a{kl_^Wj) for ^ = 2, • • 




( 8 ) 


where [x, y] denotes the concatenation of x and y vec¬ 
tors. In DCGM-II, the bias on the recurrent hidden 
state and the probability distribution over the next 
token are computed as described in Eq. 7. 


^Notice that the first layer of the encoder network is linear. 
We found that this helps learning the embedding matrix as it 
reduces the vanishing gradient effect partially due to stacking of 
squashing non-linearities iPascanu et al., 2013 i. 


























5 Experimental Setting 

5,1 Dataset Construction 

For computational efficiency and to alleviate the bur¬ 
den of human evaluators, we restrict the context se¬ 
quence c to a single sentence. Flence, our dataset is 
composed of “triples” r = (cr, nir, Vr) consisting of 
three sentences. We mined 127M context-message- 
response triples from the Twitter FireHose, covering 
the 3-month period June 2012 through August 2012. 
Only those triples where context and response were 
generated by the same user were extracted. To mini¬ 
mize noise, we selected triples that contained at least 
one frequent bigram that appeared more than 3 times 
in the corpus. This produced a corpus of 29M Twitter 
triples. Additionally, we hired crowdsourced raters to 
evaluate approximately 33K candidate triples. Judg¬ 
ments on a 5-point scale were obtained from 3 raters 
apiece. This yielded a set of 4232 triples with a mean 
score of 4 or better that was then randomly binned 
into a tuning set of 2118 triples and a test set of 2114 
triple^ The mean length of responses in these sets 
was approximately 11.5 tokens, after cleanup (e.g., 
stripping of emoticons), including punctuation. 


5.2 Automatic Evaluation 


We evaluate all systems using BLEU (Papineni et al. 


20021 and METEOR (Banerjee and Lavie, 20051, and 


supplement these results with more targeted human 


pairwise comparisons in Section 6.3 A major chal 


lenge in using these automated metrics for response 
generation is that the set of reasonable responses 
in our task is potentially vast and extremely diverse. 
The dataset construction method just described yields 
only a single reference for each status. Accordingly, 
we extend the set of references using an IR approach 
to mine potential responses, after which we have hu¬ 
man judges rate their appropriateness. As we see in 
Section |6.3[ it turns out that by optimizing systems 
towards BLEU using mined multi-references, BLEU 
rankings align well with human judgments. This lays 
groundwork for interesting future correlation studies. 


Multi-reference extraction We use the following 
algorithm to better cover the space of reasonable re¬ 
sponses. Given a test triple r = (ct, m,-, r,-), our 

^The Twitter ids of the tuning and test sets along with the 
code for the neural network models may be obtained from 
http://research.microsoft.com/convo/ 


Corpus 

# Triples 

Avg # Ref 

[Min,Max] # Ref 

Tuning 

2118 

3.22 

[L 10] 

Test 

2114 

3.58 

[L 10] 


Table 1: Number of triples, average, minimum and maxi¬ 
mum number of references for tuning and test corpora. 


goal is to mine other responses {r^} that fit the con¬ 
text and message pair (ct, m,-). To this end, we first 
select a set of 15 candidate triples {f} using an IR 
system. The IR system is calibrated in order to select 
candidate triples f for which both the message mf 
and the response rf are similar to the original mes¬ 
sage rriT and response r,-. Formally, the score of a 
candidate triple is: 

s(f, r) = d{mf,mr) (ad(rf, r,-) + (1 -a)e), (9) 


where d is the bag-of-words BM25 similarity func¬ 
tion (Robertson et al., 19951, a controls the impact 
of the similarity between the responses and e is a 
smoothing factor that avoids zero scores for candi¬ 
date responses that do not share any words with the 
reference response. We found that this simple for¬ 
mula provided references that were both diverse and 
plausible. Given a set of candidate triples {f}, hu¬ 
man evaluators are asked to rate the quality of the 
response within the new triples {{cr, mr, r-f)}. Af¬ 
ter human evaluation, we retain the references for 
which the score is 4 or better on a 5 point scale, re¬ 
sulting in 3.58 references per example on average 
(Table [TJ. The average lengths for the responses in 
the multi-reference tuning and test sets are 8.75 and 
8.13 tokens respectively. 


5.3 Feature Sets 


The response generation systems evaluated in this pa¬ 
per are parameterized as log-linear models in a frame¬ 


work typical of statistical machine translation (Och 


and Ney, 2004| ). These log-linear models comprise 
the following feature sets: 

MT MT features are derived from a large response 


generation system built along the lines of Ritter et 


al. (20111, which is based on a phrase-based MT de¬ 


coder similar to Moses ( [Koehn et al., 2007] ). Our 
MT feature set includes the following features that 
are common in Moses: forward and backward maxi¬ 
mum likelihood “translation” probabilities, word and 



















System 

BLEU 

RANDOM 

0.33 

MT 

3.21 

HUMAN 

6.08 


Table 2: Multi-reference corpus-level BLEU obtained by 
leaving one reference out at random. 

phrase penalties, linear distortion, and a modified 
Kneser-Ney language model (Kneser and Ney, 19951 
trained on Twitter responses. For the translation proh- 
ahilities, we huilt a very large phrase table of 160.7 
million entries hy first filtering out Twitterisms (e.g., 
long sequences of vowels, hashtags), and then se¬ 
lecting candidate phrase pairs using Fisher’s exact 
test ( [Ritter et ah, 2011] ). We also included MT de¬ 
coder features specifically motivated by the response 
generation task: Jaccard distance between source and 
target phrase, Fisher’s exact probability, and a score 
relating the lengths of source and target phrases. 


(Kneser and Ney, 1995 


For the translation prob 


approximating the probability of the target word (Gut 


mann and ITyvarinen, 20101. Parameter optimization 


is done using Adagrad (Duchi et al., 20111 with a 
mini-hatch size of 100 and a learning rate a = 0.1, 
which we found to work well on held-out data. In 
order to stabilize learning, we clip the gradients to 


a fixed range [—10,10], as suggested in Mikolov et 


al. (2010[). All the parameters of the neural models 


are sampled from a normal distribution Af(0,0.01) 
while the recurrent weight Whh is initialized as a 
random orthogonal matrix and scaled by 0.01. To 
prevent over-fitting, we evaluate performance on a 
held-out set during training and stop when the objec¬ 
tive increases. The size of the RLM hidden layer is 
set to iF = 512, where the context encoder is a 512, 
256, 512 multilayer network. The bottleneck in the 
middle compresses context information that leads to 
similar responses and thus achieves better generaliza¬ 
tion. The last layer embeds the context vector into 
the hidden space of the decoder RLM. 


IR We also use an IR feature built from an index of 
triples, whose implementation roughly matches the 
IRstatus approach described in Ritter et al. (20111: For 
a test triple r, we choose rf as the candidate response 
iff f = argmaxf d{mr,mr). 


CMM Neither MT nor IR traditionally take into ac¬ 
count contextual information. Therefore, we take into 
consideration context and message matches (CMM), 
i.e., exact matches between c, m and r. We define 
8 features as the [1-4]-gram matches between c and 
the candidate reply r and the [1-4]-gram matches 
between m and the candidate reply r. These exact 
matches help capture and promote contextual infor¬ 
mation in the replies. 


RLMT, DCGM-I, DCGM-II We consider the 
RLM trained on the concatenated triples, denoted as 
RLMT (Section |4~T] ), to be a context-sensitive RLM 
baseline. Each neural network model contributes an 
additional feature corresponding to the likelihood of 
the candidate response given context and message. 


5.4 Model Training 

The proposed models are trained on a 4M subset of 
the triple data. The vocabulary consists of the most 
frequent V = 50K words. In order to speed up train¬ 
ing, we use the Noise-Contrastive Estimation (NCE) 
loss, which avoids repeated summations over V by 


5.5 Rescoring Setup 

We evaluate the proposed models by rescoring the 
re-best candidate responses obtained using the MT 
phrase-based decoder and the IR system. In contrast 
to MT, the candidate responses provided by IR have 
been created by humans and are less affected by flu¬ 
ency issues. The different re-best lists will provide 
a comprehensive testbed for our experiments. Eirst, 
we augment the re-best list of the tuning set with the 
scores of the model of interest. Then, we run an itera¬ 


tion of MERT ( jOch, 2003| l to estimate the log-linear 
weights of the new features. At test time, we rescore 
the test re-best list with the new weights. 


6 Results 


6.1 Lower and Upper Bounds 

Table 2 shows the expected upper and lower bounds 
for this task as suggested by BLEU scores for human 
responses and a random response baseline. The RAN¬ 
DOM system comprises responses randomly extracted 
from the triples corpus. HUMAN is computed by 
choosing one reference amongst the multi-reference 
set for each context-status pair|^ Although the scores 

"'For the human score, we compute corpus-level BLEU with 
a sampling scheme that randomly leaves out one reference - the 
human sentence to score - for each reference set. This sampling 
scheme (repeated with 100 trials) is also applied for the MT and 



















MT n-best 

BLEU (%) 

METEOR (%) 

IR n-best 

BLEU (%) 

METEOR (%) 

MT 9 feat. 

CMM 9 feat. 

O MT H- CMM 17 feat. 

3.60 (-9.5%) 
3.33 (-16%) 
3.98 (-) 

9.19 (-0.9%) 

9.34 (+0.7%) 
9.28 (-) 

IR 2 feat. 

CMM 9 feat. 

> IR + CMM 10 feat. 

1.51 (-55%) 
3.39 (-0.6%) 
3.41 (-) 

6.25 (-22%) 

8.20 (+0.6%) 
8.04 (-) 

RLMT 2 feat. 

DCGM-I 2 feat. 

DCGM-II 2 feat. 

4.13 (+5.7%) 
4.26 (+7.0%) 
4.11 (+5.5%) 

9.54 (+2.7%) 

9.55 (+2.9%) 
9.45 (+7.s%) 

RLMT 2 feat. 

DCGM-I 2 feat. 

DCGM-II 2 feat. 

2.85 (-76%) 

3.36 (-7.5%) 

3.37 (-7.7%) 

7.38 (-8.2%) 
7.84 (-2.5%) 
8.22 (+2.5%) 

DCGM-I -|- CMM 10 feat. 
DCGM-IICMM 10 feat. 

4.44 (+77%) 
4.38 (+70%) 

9.60 (+5.5%) 
9.62 (+5.5%) 

DCGM-I + CMM 10 feat. 
DCGM-II + CMM 10 feat. 

4.07 (+79%) 
4.24 (+24%) 

8.67 (+7.8%) 
8.61 (+7.7%) 


Table 3; Context-sensitive ranking results on both MT (left) and IR (right) n-best lists, n = 1000. The subscript feat, 
indicates the number of features of the models. The log-linear weights are estimated by running one iteration of MERT. 
We mark by (±%) the relative improvements with respect to the reference system (l>). 


are lower than those usually reported in SMT tasks, 
the ranking of the three systems is unambiguous. 

6.2 BLEU and METEOR 

The results of automatic evaluation using BLEU and 
METEOR are presented in Table where some 
broad patterns emerge. Eirst, both metrics indi¬ 
cate that a phrase-based MT decoder outperforms 
a purely IR approach. Second, adding CMM features 
to the baseline systems helps. Third, the neural net¬ 
work models contribute measurably to improvement: 
RLMT and DCGM models outperform baselines, and 
DCGM models provide more consistent gains than 
RLMT. 

MT vs. IR BLEU and METEOR scores indicate 
that the phrase-based MT decoder outperforms a 
purely IR approach, despite the fact that IR proposes 
fluent human generated responses. This may be be¬ 
cause the IR model only loosely captures important 
patterns between message and response: It ranks 
candidate responses solely by the similarity of their 
message with the message of the test triple ( §5.3| l. As 
a result, the top ranked response is likely to drift from 
the purpose of the original conversation. The MT ap¬ 
proach, by contrast, more directly models statistical 
patterns between message and response. 

CMM MT-i-CMM, totaling 17 features (9 from MT 
-I- 8 CMM), improves 0.38 BLEU points, a 9.5% 
relative improvement, over the baseline MT model. 
IR-t-CMM, with 10 features (IR + word penalty -i- 
8 CMM), benefits even more, attaining 1.8 BLEU 
points and 1.5 METEOR points over the IR base- 

RANDOM system so as to make BLEU scores comparable. 


line. Eigure (a) and (b) plots the magnitude of 
the learned CMM feature weights for MT-i-CMM 
and IR-i-CMM. CMM features help in both these hy¬ 
pothesis spaces and especially on the IR n-best list. 
Eigure|^(b) supports the hypothesis formulated in the 
previous paragraph: Since IR solely captures inter¬ 
message similarities, the matches between message 
and response are important, while context matches 
help in providing additional gains. The phrase-based 
statistical patterns captured by the MT system do a 
good job in explaining away 1-gram and 2-gram mes¬ 
sage matches (Eigure (a)) and the performance gain 
mainly comes from context matches. On the other 
hand, we observe that 4-gram matches may be impor¬ 
tant in selecting appropriate responses. Inspection of 
the tuning set reveals instances where responses con¬ 
tain long subsequences of their corresponding mes¬ 
sages, e.g., m = “good night best friend, I love you”, 
r = “I love you too, good night best friend”. Although 
infrequent, such higher-order n-gram matches, when 
they occur, may provide a more robust signal of the 
quality of the response than 1- and 2-gram matches, 
given the highly conversational nature of our dataset. 

RLMT and DCGM Both RLMT and DCGM 
models outperform their respective MT and IR base¬ 
lines. Both models also exhibit similar performance 
and show improvements over the MT-t-CMM mod¬ 
els, albeit using a lower dimensional feature space. 
We believe that their similar performance is due to 
the limited diversity of MT n-best list together with 
gains in fluency stemming from the strong language 
model provided by the RLM. In the case of IR mod¬ 
els, on the other hand, there is more headroom for 
improvement and fluency is already guaranteed. Any 
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Figure 4: Comparison of the weights of learned CMM 
features for MT-hCMM and IR-hCMM systems (a) et (b) 
and DCGM-IIh-CMM on MT and IR (c) and (d). 


as evidence that CMM exact matches and DCGM 
semantic matches interact positively, a finding that 
comports with Gao et al. (2014a l, who show that 
semantic relationships mined through phrase embed¬ 
dings correlate positively with classic co-occurrence- 
hased estimations. Analysis of CMM feature weights 
in Figure |^(c) and (d) suggests that 1-gram matches 
are explained away hy the DCGM model, hut that 
higher order matches are important. It appears that 
DCGM models might he improved hy preserving 
word-order information in context and message en¬ 
codings. 


System A 

System B 

Gain (%) 

CI 

HUMAN 

MTh-CMM 

13.6* 

[12.4,14.8] 

DCGM-II 

MT 

1.9* 

[0.8, 2.9] 

DCGM-IIh-CMM 

MT 

3.1* 

[2.0, 4.3] 

DCGM-IIh-CMM 

MTh-CMM 

1.5* 

[0.5, 2.5] 

DCGM-II 

IR 

5.2* 

[4.0, 6.4] 

DCGM-IIh-CMM 

IR 

5.3* 

[4.1, 6.6] 

DCGM-IIh-CMM 

IRh-CMM 

2.3* 

[1.2, 3.4] 


Table 4: Pairwise human evaluation scores between Sys¬ 
tem A and B. The first (second) set of results refer to the 
MT (IR) hypothesis list. The asterisk means agreement 
between human preference and BLEU rankings. 


gains must come from context and message matches. 
Hence, RLMT underperforms with respect to both 
DCGM and IR-i-CMM. The DCGM models appear to 
have better capacity to retain contextual information 
and thus achieve similar performance to IR-tCMM 
despite their lack of exact n-gram match information. 

In the present experimental setting, no striking 
performance difference can be observed between the 
two versions of the DCGM architecture. If multiple 
sequences were used as context, we expect that the 
DCGM-II model would likely benefit more owing to 
the separate encoding of message and context. 


DCGMh-CMM We also investigated whether mix¬ 
ing exact CMM n-gram overlap with semantic in¬ 
formation encoded by the DCGM models can bring 
additional gains. DCGM-{I-II}-i-CMM systems each 
totaling 10 features show increases of up to 0.48 
BLEU points over MT-i-CMM and up to 0.88 BLEU 
over the model based on [Ritter et al. (2011 ). ME¬ 
TEOR improvements similarly align with BEEU im¬ 
provements both for MT and IR lists. We take this 


6.3 Human Evaluation 

Human evaluation was conducted using crowd- 
sourced annotators. Annotators were asked to com¬ 
pare the quality of system output responses pairwise 
(“Which is better?”) in relation to the context and 
message strings in the 2114 item test set. Identical 
strings were held out, so that the annotators only saw 
those outputs that differed. Paired responses were 
presented in random order to the annotators, and each 
pair of responses was judged by 5 annotators. 

Table 4 summarizes the results of human evalua¬ 
tion, giving the difference in mean scores (pairwise 
preference margin) between systems and 95% confi¬ 
dence intervals generated using Welch’s t-test. Iden¬ 
tical strings not shown to raters are incorporated with 
an automatically assigned score of 0.5. The pattern in 
these results is clear and consistent: context-sensitive 
systems (h-CMM) outperform non-context-sensitive 
systems, with preference gains as high as approxi¬ 
mately 5.3% in the case of DCGM-IIh-CMM versus 
IR, and about 3.1% in the case of DCGM-IIh-CMM 
versus MT. Similarly, context-sensitive DCGM sys¬ 
tems outperform non-DCGM context-sensitive sys¬ 
tems by 1.5% (MT) and 2.3% (IR). These results are 
consistent with the automated BLEU rankings and 
confirm that our best performing DCGM models out¬ 
perform both raw baseline and the context-sensitive 
baseline using CMM features. 

6.4 Discussion 

Table [^provides examples of responses generated on 
the tuning corpus by the MT-based DCGM-IIh-CMM 
system, our best system in terms of both BEEU and 
human evaluation. Responses from this system are on 
average shorter (8.95 tokens) than the original human 






















# Context c 


Message m 


Generated Response r 


1 noone can help this time its just i need a support system well im here to support you. whenever you thank you, i appreciate it thank you 

need me. 

2 nothin much, and how’s the book?! its good but i’m only like halfway through that’s good! i have the book but i’m 

cuz i don’t feel like reading, i’m so bored ... bored too. 

3 i know! what have you been up to? nothing really, just enjoying summer and that’s good! nothing much just 

working, what about you? listening to music and you? 

4 thanks, your way better though i’m just good at expressing it that’s but trust i know, i don’t trust you, i love you 

me you are freakin amazing and waay better too! thank you 

5 thanks, yeah i am good thanks! good, but angry! sorry you’re angry, been there, i hope it dies i hope so too, i’m not angry!! 

down. 

6 figures, bass, never ate bass. bass is extremely mild with very few bones. that’s one of the bones, chuck bass is 

my favorite after catfish mine too. 

7 today i will walk for miles, clear sky clear mind. love walking, it’s too hot to walk this month, it’s too hot for you to get up to honey 

i haven’t moved, frozen, heat freeze. if you have frozen yogurt to you. 

8 arresting people over “nasty tweets”? please, i’m not it appears he may have sent death threats to i did! he was armed with nuclear 

condoning what the lad did but arresting him is a tad others while all this was kicking off too. threats? that’s what’s happening to 

silly. you. 


Table 5: Sample responses produced by the MT-based DCGM-II+CMM system. 


responses in the tuning set (11.5 tokens). Overall, the 
outputs tend to be generic or commonplace, but are 
often reasonably plausible in the context as in ex¬ 
amples 1-3, especially where context and message 
contain common conversational elements. Example 2 
illustrates the impact of context-sensitivity: the word 
“book” in the response is not found in the message. 
Nonetheless, longer generated responses are apt to 
degrade both syntactically and in terms of content. 
We notice that longer responses are likely to present 
information that conflicts either internally within the 
response itself, or is at odds with the context, as in ex¬ 
amples 4-5. This is not unsurprising, since our model 
lacks mechanisms both for reflecting agent intent in 
the response and for maintaining consistency with 
respect to sentiment polarity. Longer context and 
message components may also result in responses 
that wander off-topic or lapse into incoherence as in 
6-8, especially when relatively low frequency uni¬ 
grams (“bass”, “threat”) are echoed in the response. 
In general, we expect that larger datasets and incorpo¬ 
ration of more extensive contexts into the model will 
help yield more coherent results in these cases. Con¬ 
sistent representation of agent intent is outside the 
scope of this work, but will likely remain a significant 
challenge. 


7 Conclusion 


We have formulated a neural network architecture 
for data-driven response generation trained from so¬ 
cial media conversations, in which generation of 
responses is conditioned on past dialog utterances 
that provide contextual information. We have pro¬ 
posed a novel multi-reference extraction technique 
allowing for robust automated evaluation using stan¬ 
dard SMT metrics such as BLEU and METEOR. 
Our context-sensitive models consistently outper¬ 
form both context-independent and context-sensitive 
baselines by up to 11% relative improvement in 
BLEU in the MT setting and 24% in the IR setting, al¬ 
beit using a minimal number of features. As our mod¬ 
els are completely data-driven and self-contained, 
they hold the potential to improve fluency and con¬ 
textual relevance in other types of dialog systems. 

Our work suggests several directions for future 
research. We anticipate that there is much room for 
improvement if we employ more complex neural net¬ 
work models that take into account word order within 
the message and context utterances. Direct genera¬ 
tion from neural network models is an interesting and 
potentially promising next step. Future progress in 
this area will also greatly benefit from thorough study 
of automated evaluation metrics. 
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